Heuristics for chemical compound matching.
نویسندگان
چکیده
We have developed an efficient algorithm for comparing two chemical compounds, where the chemical structure is treated as a 2D graph consisting of atoms as vertices and covalent bonds as edges. Based on the concept of functional groups in chemistry, 68 atom types (vertex types) are defined for carbon, nitrogen, oxygen, and other atomic species with different environments, which has enabled detection of biochemically meaningful features. Maximal common subgraphs of two graphs can be found by searching for maximal cliques in the association graph, and we have introduced heuristics to accelerate the clique finding. Our heuristic procedure is controlled by some adjustable parameters. Here we applied our procedure to the latest KEGG/LIGAND database with different sets of parameters, and demonstrated the correlation of parameters in our algorithm with the distribution of similarity scores and/or the execution time. Finally, we showed the effectiveness of our heuristics for compound pairs along metabolic pathways.
منابع مشابه
SURFCOMP: A Novel Graph-Based Approach to Molecular Surface Comparison
Analysis of the distributions of physicochemical properties mapped onto molecular surfaces can highlight important similarities or differences between compound classes, contributing to rational drug design efforts. Here we present an approach that uses maximal common subgraph comparison and harmonic shape image matching to detect locally similar regions between two molecular surfaces augmented ...
متن کاملDBCHEM: A Database Query Based Solution for the Chemical Compound and Drug Name Recognition Task
We propose a method, named DBCHEM, based on database queries for the chemical compound and drug name recognition task of the BioCreative IV challenge. We prepared a database with 145 million entries containing compound and drug names, their synonyms, and molecular formulas. PubChem Power User Gateway (PUG) system is used to construct the database. Candidate chemical and drug names are identifie...
متن کاملMultiple-Instance Learning Based Heuristics for Mining Chemical Compound Structure
Inductive Logic Programming (ILP) is a combination of inductive learning and first-order logic aiming to learn first-order hypotheses from training examples. ILP has a serious bottleneck in an intractably enormous hypothesis search space. This makes existing approaches perform poorly on large-scale real-world datasets. In this research, we propose a technique to make the system handle an enormo...
متن کاملChemistry-specific Features and Heuristics for Developing a CRF-based Chemical Named Entity Recogniser
We describe and compare methods developed for the BioCreative IV chemical compound and drug name recognition (CHEMDNER) task. The presented conditional random fields (CRF)-based named entity recogniser employs a statistical model trained on domain-specific features, in addition to those typically used in biomedical NERs. In order to increase recall, two heuristics-based post-processing steps we...
متن کاملOccurrence and Substring Heuristics for i-Matching
We consider a version of pattern matching useful in processing large musical data: matching, which consists in finding matches which are -approximate in the sense of the distance measured as maximum difference between symbols. The alphabet is an interval of integers, and the distance between two symbols , is measured as . We also consider -matching, where is a bound on the total sum of the diff...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Genome informatics. International Conference on Genome Informatics
دوره 14 شماره
صفحات -
تاریخ انتشار 2003